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OO , Abstract 

in 

^SJ ■ In this paper, we address the problem of learning the structure of a pairwise 

CO ' graphical model from samples in a high-dimensional setting. Our first main re- 

sult studies the sparsistency, or consistency in sparsity pattern recovery, properties 

^-|s ■ of a forward-backward greedy algorithm as applied to general statistical models. 

As a special case, we then apply this algorithm to learn the structure of a discrete 
graphical model via neighborhood estimation. As a corollary of our general result, 
we derive sufficient conditions on the number of samples n, the maximum node- 
degree d and the problem size p, as well as other conditions on the model param- 

^^ . eters, so that the algorithm recovers all the edges with high probability. Our result 

$_i| ' guarantees graph selection for samples scaling as n = fl{d^ log(p)), in contrast 

Cw I to existing convex-optimization based algorithms that require a sample complexity 

of Q,{d'^ log(p)). Further, the greedy algorithm only requires a restricted strong 
convexity condition which is typically milder than irrepresentability assumptions. 
We corroborate these results using numerical simulations at the end. 

1 Introduction 

Undirected graphical models, also known as Markov random fields, are used in a va- 
riety of domains, including statistical physics, natural language processing and image 
analysis among others. In this paper we are concerned with the task of estimating 
the graph structure G of a Markov random field (MRF) over a discrete random vec- 
tor X = {Xi, X2, . . . , Xp), given n independent and identically distributed samples 



{x^^\x^'^\ . . . , x''"^}. This underlying graph structure encodes conditional indepen- 
dence assumptions among subsets of the variables, and thus plays an important role in 
a broad range of applications of MRFs. 

Existing approaches: Neighborhood Estimation, Greedy Local Search. Methods for 
estimating such graph structure include those based on constraint and hypothesis test- 
ing [22], and those that estimate restricted classes of graph structures such as trees [8], 
polytrees [11], and hypertrees [23]. A recent class of successful approaches for graph- 
ical model structure learning are based on estimating the local neighborhood of each 
node. One subclass of these for the special case of bounded degree graphs involve 
the use of exhaustive search so that their computational complexity grows at least 
as quickly as 0{p'^), where d is the maximum neighborhood size in the graphical 
model [1, 4, 9]. Another subclass use convex programs to learn the neighborhood 
structure: for instance [20, 17, 16] estimate the neighborhood set for each vertex r G V 
by optimizing its ^i -regularized conditional likelihood; [15, 10] use ^i/i?2 -regularized 
conditional likelihood. Even these methods, however need to solve regularized con- 
vex programs with typically polynomial computational cost of O(p^) or 0{p^), are 
still expensive for large problems. Another popular class of approaches are based on 
using a score metric and searching for the best scoring structure from a candidate set 
of graph structures. Exact search is typically NP-hard [7]; indeed for general discrete 
MRFs, not only is the search space intractably large, but calculation of typical score 
metrics itself is computationally intractable since they involve computing the partition 
function associated with the Markov random field [26]. Such methods thus have to use 
approximations and search heuristics for tractable computation. Question: Can one 
use local procedures that are as inexpensive as the heuristic greedy approaches, and yet 
come with the strong statistical guarantees of the regularized convex program based 
approaches? 

High-dimensional Estimation; Greedy Methods. There has been an increasing focus 
in recent years on high-dimensional statistical models where the number of parameters 
p is comparable to or even larger than the number of observations n. It is now well 
understood that consistent estimation is possible even under such high-dimensional 
scaling if some low-dimensional structure is imposed on the model space. Of rele- 
vance to graphical model structure learning is the structure of sparsity, where a sparse 
set of non-zero parameters entail a sparse set of edges. A surge of recent work [5, 12] 
has shown that i?i-regularization for learning such sparse models can lead to practical 
algorithms with strong theoretical guarantees. A line of recent work (cf. paragraph 
above) has thus leveraged this sparsity inducing nature of ^i-regularization, to propose 
and analyze convex programs based on regularized log-likelihood functions. A related 
line of recent work on learning sparse models has focused on "stagewise" greedy algo- 
rithms. These perform simple forward steps (adding parameters greedily), and possibly 
also backward steps (removing parameters greedily), and yet provide strong statistical 
guarantees for the estimate after a finite number of greedy steps. The forward greedy 
variant which performs just the forward step has appeared in various guises in multiple 
communities: in machine learning as boosting [13], in function approximation [24], 
and in signal processing as basis pursuit [6]. In the context of statistical model estima- 
tion, Zhang [28] analyzed the forward greedy algorithm for the case of sparse linear 
regression; and showed that the forward greedy algorithm is sparsistent (consistent for 



model selection recovery) under the same "irrepresentable" condition as that required 
for "sparsistency" of the Lasso. Zhang [27] analyzes a more general greedy algorithm 
for sparse linear regression that performs forward and backward steps, and showed 
that it is sparsistent under a weaker restricted eigenvalue condition. Here we ask the 
question: Can we provide an analysis of a general forward backward algorithm for 
parameter estimation in general statistical models? Specifically, we need to extend 
the sparsistency analysis of [28] to general non-linear models, which requires a subtler 
analysis due to the circular requirement of requiring to control the third order terms in 
the Taylor series expansion of the log-likelihood, that in turn requires the estimate to 
be well-behaved. Such extensions in the case of i'l-regularization occur for instance in 
[20, 25, 3]. 

Our Contributions. In this paper, we address both questions above. In the first part, 
we analyze the forward backward greedy algorithm [28] for general statistical models. 
We note that even though we consider the general statistical model case, our analysis is 
much simpler and accessible than [28], and would be of use even to a reader interested 
in just the linear model case of Zhang [28]. In the second part, we use this to show 
that when combined with neighborhood estimation, the forward backward variant ap- 
plied to local conditional log-likelihoods provides a simple computationally tractable 
method that adds and deletes edges, but comes with strong sparsistency guarantees. 
We reiterate that the our first result on the sparsistency of the forward backward greedy 
algorithm for general objectives is of independent interest even outside the context of 
graphical models. As we show, the greedy method is better than the ^i -regularized 
counterpart in [20] theoretically, as well as experimentally. The sufficient condition on 
the parameters imposed by the greedy algorithm is a restricted strong convexity condi- 
tion [19], which is weaker than the irrepresentable condition required by [20]. Further, 
the number of samples required for sparsistent graph recovery scales as 0{d^ logp), 
where d is the maximum node degree, in contrast to 0{d^ logp) for the ^i -regularized 
counterpart. We corroborate this in our simulations, where we find that the greedy 
algorithm requires fewer observations than [20] for sparsistent graph recovery. 

2 Review, Setup and Notation 
2.1 IVIarkov Random Fields 

Let X = {Xi, . . . , Xp) be a random vector, each variable Xi taking values in a dis- 
crete set X of cardinality m. Let G — {V, E) denote a graph with p nodes, cor- 
responding to the p variables {Xi, . . . , Xp]. A pairwise Markov random field over 
X = (Xi, . . . , Xp) is then specified by nodewise and pairwise functions 9,. : X ^^M. 
for all r &V,wd9rt-X X X^m. for all (r, t) e E: 

r{x) X exp {J2 Sr{Xr) + ^ ert{Xr,Xt)}. (1) 

rev (r,t)eE 

In this paper, we largely focus on the case where the variables are binary with X = 
{—1, +1}, where we can rewrite (1) to the Ising model form [14] for some set of 
parameters {Or} and {9rt\ as 

P(a;) ex exp { N^ OrXr + Y^ OrtXrXt}. (2) 

rev (r,t)eE 



2.2 Graphical Model Selection 

Let D := {a;(^\ . . . ,x*^"^} denote the set of n samples, where each p-dimensional 
vector x^*) e {1, . . . , my is drawn i.i.d. from a distribution Pg* of the form (1), for 
parameters 6* and graph G = (F, E*) over the p variables. Note that the true edge set 
E* can also be expressed as a function of the parameters as 

E* ={{r,t)^VxV:ei^^Q}. (3) 

The graphical model selection task consists of inferring this edge set E* from the 
samples D. The goal is to construct an estimator En for which '¥'[En ~ E*] -^ 1 as 
n — > oo. Denote by M* (r) the set of neighbors of a vertex r <E T^, so that M* (r) = 
{t : (r, t) e E*}. Then the graphical model selection problem is equivalent to that of 
estimating the neighborhoods A/'„ (r) C V, so that P[A/'„(r) = Af {r);\/r e V] ^ I 
as n — >■ oo. 

For any pair of random variables Xr and Xt, the parameter 9rt fully characterizes 
whether there is an edge between them, and can be estimated via its conditional like- 
lihood. In particular, defining O^ := (6'ri, . . . , Orp), our goal is to use the conditional 
likelihood of Xr conditioned on Xy\r to estimate 8^ and hence its neighborhood 
M{r). This conditional distribution of Xr conditioned on Xy\r generated by (2) is 
given by the logistic model 

_ I _ \ _ exp(6lra;r + EteV\r SrtXrXt) 

Xr — Xr Av\r — XY\r I — "p- 77 ""^^ 7 T- 

I ' W l+C-yip{er+YjreV\r^rtXr) 

Given the n samples D, the corresponding conditional log-likelihood is given by 

C{Qr;D) = if^ i logjl+cxp I e,x«+^ Ortx^^x^^ ))-^'-a;«-^ e.ta;«a;« I . 

(4) 

In Section 4, we study a greedy algorithm (Algorithm 2) that finds these node neigh- 
borhoods Afn{r) — Supp(9r) of each random variable Xr separately by a greedy 
stagewise optimization of the conditional log-likelihood of Xr conditioned on Xv\r- 
The algorithm then combines these neighborhoods to obtain a graph estimate E using 
an "OR" rule: £"„ = Lir{{r, t) : t e 7V'„(r)}. Other rules such as the "AND" rule, that 
add an edge only if it occurs in each of the respective node neighborhoods, could be 
used to combine the node-neighborhoods to a graph estimate. We show in Theorem 2 
that the neighborhood selection by the greedy algorithm succeeds in recovering the 
exact node-neighborhoods with high probability, so that by a union bound, the graph 
estimates using either the AND or OR rules would be exact with high probability as 
well. 

Before we describe this greedy algorithm and its analysis in Section 4 however, we 
first consider the general statistical model case in the next section. We first describe 
the forward backward greedy algorithm of Zhang [28] as applied to general statistical 
models, followed by a sparsistency analysis for this general case. We then specialize 
these general results in Section 4 to the graphical model case. The next section is thus 
of independent interest even outside the context of graphical models. 



Algorithm 1 Greedy forward-backward algorithm for finding a sparse optimizer of £(■) 

Input: Data D := {x^^\ . . . ,a;^"'}, Stopping Ttirestiold £5, Backward Step Factor u G 

(0,1) 

Output: Sparse optimizer 6 

e(°) i — and S^'") i — (j) and k i — 1 

while true do {Forward Step} 

{jf.,at) i — arg ^min C{9^ ^ ' +aej; D) 

§(fc)^5{fc-i)u{j.} 

<5f ^ <~ £(^('=-1' ; D) - C{e^''-^^ + a,ej, ■ D) 



if Sy' < es then 



break 
end if 



9^''' < — argmin £(6g(k);D) 
k< — fc + 1 

while true do {Backward Step} 

J i — arg mm £{6^ S] ej;D) 

if £(^(fe-i) _ gt^Dg^.. ^) - ^(eC^-i'; D) > lySf^ then 

break 
end if 



n(k-l) 



argmin C(9g(k-i);D) 



ki — fc- 1 
end while 

end while 



3 Greedy Algorithm for General Losses 

Consider a random variable Z with distribution P, and let Z" := {Zi, . . . , Z„} denote 
n observations drawn i.i.d. according to P. Suppose we are interested in estimating 
some parameter 9* e MP of the distribution P that is sparse; denote its number of non- 
zeroes by s* := ||^*||o- Let £ : K^ x Z" h^ M be some loss function that assigns a 
cost to any parameter 9 E MP, for a given set of observations Z". For ease of notation, 
in the sequel, we adopt the shorthand C{9) for C{9; Z"). We assume that 9* satisfies 

We now consider the forward backward greedy algorithm in Algorithm 1 that 
rewrites the algorithm in [27] to allow for general loss functions. The algorithm starts 
with an empty set of active variables S**^"' and gradually adds (and removes) vairables 



to the active set until it meets the stopping criterion. This algorithm has two major 
steps: the forward step and the backward step. In the forward step, the algorithm finds 
the best next candidate and adds it to the active set as long as it improves the loss func- 
tion at least by eg, otherwise the stopping criterion is met and the algorithm terminates. 
Then, in the backward step, the algorithm checks the influence of all variables in the 
presence of the new added variable. If one or some of the previously added variables do 
not contribute at least veg to the loss function, then the algorithm removes them from 
the active set. This procedure ensures that at each round, the loss function is improved 
by at least (1 — v)es and hence it terminates within a finite number of steps. 

We state the assumptions on the loss function so that sparsistency could be guar- 
anteed. Let us first recall the definition of restricted strong convexity from Negahban 
et al. [18]. Specifically, for a given set S, the loss function is said to satisfy restricted 
strong convexity (RSC) with parameter k; if 

£(6' + A;Zn->C(6';Zn-(V£(6i;Zn,A) > y ||A||2 for all A e §. (5) 

We can now define sparsity restricted strong convexity as follows. Specifically, we say 
that the loss function C satisfies RSC{k) with parameter ki if it satisfies RSC with 
parameter k; for all sets S C {1, . . . ,p} such that ||S'||o < k. 

In contrast, we say the loss function satisfies restricted strong smoothness (RSS) 
with parameter k„, for a given set S if 

Cie + A;Zl')-Cie;Zl')-{VC{9;Z^),A) < -y ||A||2 for all A eS. 

We can define RSS{k) similarly: the loss function C satisfies RSS{k) with parameter 
Ku if it satisfies RSS with parameter Ku for all sets 5 C {1, . . . ,p} such that |lS'||o < k 
at all points 6 with \\6\\q < k. Given any constants ki and k„, and a sample based loss 
function £, we can typically use concentration based arguments to obtain bounds on the 
sample size required so that the RSS and RSC conditions hold with high probability. 
Another property of the loss function that we require is an upper bound A„ on the 
iao norm of the gradient of the loss at the true parameter 9*, i.e., A„ > || V£(6'*)||co- 
This captures the "noise level" of the samples with respect to the loss. Here too, we can 
typically use concentration arguments to show for instance that A„ < c„(log(p)/n)^/^, 
for some constant c„ > with high probability. 

Theorem 1 (Sparsistency). Suppose the loss function £(•) satisfies RSC {rj s*) and 

RSS (j] s*) with parameters ki and Ku for some 77 > 2 + 4p^(\/(p2 _ p)/s* + \/2)^ 
with p = Ku/ki. Moreover, suppose that the true parameters 9* satisfy miiij^s' \0j\ > 
y^32pes/Ki. Then if we run Algorithm 1 with stopping threshold es > i^pv/'^i) ■s*A^, 
the output 6 with support S satisfies: 

(a) Error Bound: ||6'-r||2< ^ Vs* {Xn^/v - 

(b) No False Exclusions: S* -S = $. 

(c) No False Inclusions: S - S* =0. 



Proof. The proof theorem hinges on three main lemmas: Lemmas 5 and 7 are simple 
consequences of the forward and backward steps failing when the greedy algorithm 
stops, and Lemma 6 which uses these two lemmas and extends techniques from [21] 
and [19] to obtain an ^2 error bound on the error. Provided these lemmas hold, we then 
show below that the greedy algorithm is sparsistent. However, these lemmas require 
apriori that the RSC and RSS conditions hold for sparsity size \S* U S\. Thus, we use 
the result in Lemma 8 that if RSC{rjs*) holds, then the solution when the algorithm 
terminates satisfies \S\ < {rj — l)s*, and hence jS U 5**1 < rjs* . Thus, we can then 
apply Lemmas 5, 7 and Lemma 6 to complete the proof as detailed below. 

(a) The result follows directly from Lemma 6, and noting that | S" U S"* | < rjs* . In that 
Lemma, we show that the upper bound holds by drawing from fixed point techniques 
in [21] and [19], and by using a simple consequence of the forward step failing when 
the greedy algorithm stops. 

(b) Following the argument in [27], we use the chaining argument. For any r e M, 
we have 



\{jes'-S:\e*\'>T}\<\\e*_^\\l < 



S -2 \ -2 1-5 --^l, 

where the last inequality follows from part (a) and the inequality (a + 6)^ < 2a^ + 
26^. Now, setting r — 22£^ls ^ and dividing both sides by t/2 we get 

2\{j es*-S: \e*f > r}\ < f^ + IS" - s\. 

Substituting \{j e S* - S : \e*\'^ > t}\ = \S* - S\ - \{j e S* - S : \e*\'^ < t}\, we 
get 

IS-* -s\< \{j es'-S: \e;\^ < r}\ + |i^ < \{j es'^s-. \e;\' < r}\ + 1/2, 

due to the setting of the stopping threshold €5. This in turn entails that 

\S*-S\<\{jeS*-S:\e*\^<T}\ = 0, 
by our assumption on the size of the minimum entry of 6** . 

(c) From Lemma 7, which provides a simple consequence of the backward step fail- 
ing when the greedy algorithm stops, for A — 9 — 9* , we have es/Ku\S ~ S*\ < 
ll^s-s* II2 — ll^lll' so '^hat using Lemma 6 and that \S* — 5*1 = 0, we obtain that 
jS — 5*1 < -^ — %^ < 1/2, due to the setting of the stopping threshold es- 



3.1 Lemmas for Theorem 1 

We list the simple lemmas that characterize the solution obtained when the algorithm 
terminates, and on which the proof of Theorem 1 hinges. 



Algorithm 2 Greedy forward-backward algorithm for pairwise discrete graphical model learn- 
ing 

Input: Data D := {x^^\ . . . , a;'"'}, Stopping Threshold e^. Backward Step Factor u G 

(0,1) 

Output: Estimated Edges E 

for r e y do 

Run Algorithm 1 with £{■) described by (4) to get Or and its support A/'r 
end for 



Outputs = Ur {ir,t) ■ i G AVJ 



Lemma 1 (Stopping Forward Step). When the algorithm 1 stops with parameter 9 
supported on S, we have 



c[e] -cie*)] < \ 2\s*-s\Kues 



Lemma 2 (Stopping Backward Step). When the algorithm 1 stops with parameter 9 
supported on S, we have 






Lemma 3 (Stopping Error Bound). When the algorithm 1 stops with parameter 9 sup- 
ported on S, we have 



11^- 61* II < — I A„ J IS"* US'! + ^215"* - S\tiues 

I II 2 Kl 

,2 / I I — \-2 

Lemma 4 (Stopping Size). If es > i^ \\/^i ^ \/~) '^"'^ RSC {r]s*) holds for 



some ?7 > 2 + 4p^ I y ^—r^ + \/2 j , then the algorithm 1 stops with k < (rj — 1) 



Notice that if e^ > {Spri/ni) [rj^ / [Ap^)) A^, then, the assumption of this lemma 
is satisfied. Hence for large value of s* > Sp^ > rj^ /{Ap^), it suffices to have ts > 

{Sprj/Ki) s*\l. 

4 Greedy Algorithm for Pairwise Graphical Models 

Suppose we are given set of n i.i.d. samples D := {x'^^\ . . . , x*^"'}, drawn from a 
pairwise Ising model as in (2), with parameters 9*, and graph G = (V, E*). It will be 
useful to denote the maximum node-degree in the graph E* by d. As we will show, our 
model selection performance depends critically on this parameter d. We then propose 
the Algorithm 2 for estimating the underlying graphical model from the n samples D. 



Theorem 2 (Pairwise Sparsistency). Suppose we run Algorithm 2 with stopping thresh- 
old es ^ Ci — 2ML^ where, d is the maximum node degree in the graphical model, and 
the true parameters 9* satisfy -% > niinjgg. |^!| > C2^/es, and further that number 
of samples scales as 

n > Cid"^ \ogp, 

for some constants ci, C2, C3, C4. Then, with probability at least 1 — c' exp(— c"n), the 
output 9 supported on S satisfies: 

(a) No False Exclusions: E* - E ^^. 

(b) No False Inclusions: E - E* =0. 

Proof. This theorem is a corollary to our general Theorem 1 . We first show that the 
conditions of Theorem 1 hold under the assumptions in this corollary. 
RSC, RSS. We first note that the conditional log-likelihood loss function in (4) corre- 
sponds to a logistic likelihood. Moreover, the covariates are all binary, and bounded, 
and hence also sub-Gaussian. [19, 2] analyze the RSC and RSS properties of gen- 
eralized linear models, of which logistic models are an instance, and show that the 
following result holds if the covariates are sub-Gaussian. Let dC{A;9*) — C{9* + 
Is.) ~ C{9*) — {WC{9* ) , A) be the second order Taylor series remainder. Then, Propo- 
sition 2 in [19] states that that there exist constants k[ and ^2, independent of n,p such 
that with probability at least 1 — ci cxp(— C2ri), for some constants ci, C2 > 0, 



d£{A;9*) > «:'J|A||2<'||A||2-4\/^^^l|A||ii forall A : ||A||2 < 1. 



Thus, if II Alio < k := ryd, then ||A||i < Vfc||A||2, so that 



dC{A;9*) > |lA||^U-4V^j > yll^lli 

ifn > 4{k2 / k[)'^ rjd log{p). In otherwords, with probability atleast 1—ci exp(—C2n), 
the loss function C satisfies RSC{k) with parameter k{ provided n > '^{n^,/ k\Y tj^ log(p). 
Similarly, it follows from [19, 2] that there exist constants k" and ^2 such that with 
probability at least 1 — c'^ cxp(— C2n), 

d£{A;9*) < <||A||2{||A||2-K^||A||i} for all A : ||A||2 < 1, 

so that by a similar argument, with probability at least 1 — c'^ cxp(— C2ri), the loss 
function C satisfies RSS{k) with parameter nf provided n > 4(^2 /t")^ r/d log(p). 
Noise Level. Next, we obtain a bound on the noiselevel A„ > || V£(0*)||oo following 
similar arguments to [20]. Let W denote the gradient V C{9*) of the loss function (4). 
Any entry of W has the form W* = i X^^li ■Z'rt^ where Z^j' = x^^\x'^^ - P(a;^ = 
l|xf*j)) are zero-mean, i.i.d. and bounded | Z}j | < 1. Thus, an application of Hoeffd- 
ing's inequality yields that P[|Wt| > 5] < 2exp(— 2ri(5^). Applying a union bound 



over indices in W, we get P[|| TVHco > S] < 2 exp{-2nS'^ + log(p)). Thus, if A„ = 
{\og{p)/ny/^, then ||T4^|loo < A„ with probability at least 1 — cxp(— nA^ + log(p)). 
We can now verify that under the assumptions in the corollary, the conditions 
on the stopping size e^ and the minimum absolute value of the non-zero parameters 
miuj^s* \(^j I are satisfied. Moreover, from the discussion above, under the sample size 
scaling in the corollary, the required RSC and RSS conditions hold as well. Thus, 
Theorem 1 yields that each node neighborhood is recovered with no false exclusions or 
inclusions with probability at least 1 — c' cxp(— c"n). An application of a union bound 
over all nodes completes the proof. 

D 

Remarks. The sufficient condition on the parameters imposed by the greedy al- 
gorithm is a restricted strong convexity condition [19], which is weaker than the ir- 
representable condition required by [20]. Further, the number of samples required for 
sparsistent graph recovery scales as 0{(P logp), where d is the maximum node degree, 
in contrast to 0{d^ logp) for the £i regularized counterpart. We corroborate this in our 
simulations, where we find that the greedy algorithm requires fewer observations than 
[20] for sparsistent graph recovery. 

We also note that the result can also be extended to the general pairwise graphical 
model case, where each random variable takes values in the range {1, . . . , m}. In 
that case, the conditional likelihood of each node conditioned on the rest of the nodes 
takes the form of a multiclass logistic model, and the greedy algorithm would take the 
form of a "group" forward-backward greedy algorithm, which would add or remove all 
the parameters corresponding to an edge as a group. Our analysis however naturally 
extends to such a group greedy setting as well. The analysis for RSC and RSS remains 
the same and for bounds on A„, see equation (12) in [15]. We defer further discussion 
on this due to the lack of space. 



5 Experimental Results 

We now present experimental results that illustrate the power of Algorithm 2 and sup- 
port our theoretical guarantees. We simulated structure learning of several different 
graph structures and compared the learning rates of our method against that of a stan- 
dard ^1 -logistic regression method as outlined in [20]. 

We performed experiments using 3 different graph structures: (a) chain (line graph), 
(b) 4-nearest neighbor (grid graph) and (c) star graph. For each experiment, we as- 
sumed a pairwise binary Ising model in which each 9*^ — ±1 randomly. For each 
graph type, we generated a set of n samples x'-^\ ..., a;'") using Gibbs sampling. We 
then attempted to learn the structure of the model using both Algorithm 2 as well as 
€i -logistic regression. We then compared the actual graph structure with the empir- 
ically learned graph structures. If the graph structures matched completely then we 
declared the result a success otherwise we declared the result a failure. We compared 
these results over a range of sample sizes (n) and averaged the results for each sample 
size over a batch of size 10. For all greedy experiments we set the stopping threshold 
es = £-2Ii2£i^ where c is a tuning constant, as suggested by Theorem 2, and set the 
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Fig 1: Plots of success probability P[A/±(r) = 7V*(r),Vr e V\ versus the control 
parameter f3{n,p,d) — n/[20d\og{p)] for Ising model on (a) chain {d — 2), (b) 4- 
nearest neighbor {d ~ 4) and (c) Star graph {d — O.lp). The coupling parameters are 
chosen randomly from 9*^ — ±0.50 for both greedy and i?i-logistic regression meth- 
ods. As our theorem suggests and these figures show, the greedy algorithm requires 
less samples to recover the exact structure of the graphical model. 



backwards step threshold v = 0.5. For all i?i-logistic regression experiments we set 
the regularization parameter A„ = c' ■\/log(p)/n, where c' is set via cross-validation. 

Figure 1 shows the results for the chain {d = 2), grid {d = 4) and star {d = O.lp) 
graphs using both Algorithm 2 and ^i-logistic regression for three different graph sizes 
p G {36,64, 100} with mixed (random sign) couplings. For each sample size, we 
generated a batch of 10 different graphical models and averaged the probability of 
success (complete structure learned) over the batch. Each curve then represents the 
probability of success versus the control parameter /3(n,p, d) = n/[20d\og{p)] which 
increases with the sample size n. These results support our theoretical claims and 
demonstrate the efficiency of the greedy method in comparison to node-wise logistic 
regression [20]. 
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A Auxiliary Lemmas for Theorem 1 

In this section, we prove the Lemmas used in the proof of Theorem 1 . Note that when 
the algorithm terminates, the forward step fails to go through. This entails that 



C{e)- inf C{e + aej) <€s. 



(6) 



The next lemma shows that this has the consequence of upper bounding the devia- 
tion in loss between the estimated parameters 6 and the true parameters 6* . 

Lemma 5 (Stopping Forward Step). When the algorithm stops with parameter 9 sup- 
ported on S, we have 



c{tj-L{e*) < ^J2\S*-S\ 



Ku<^S 



e-e* 



(7) 



Proof. Let A = 6* — 9. For any 77 e R, we have 



C(e + 77A,e, ] <C{e]+ riVjC ( ^ ) A, + r/'^A^. 



Thus, we can establish 



jes'-s 



ijAjCj - C 



<T]U{6 
Optimizing the RHS over 77, we obtain 

-\S*-S\es< 
whence the lemma follows. 



C 



2 "^u 



A 



C{0*)-C 



2k„||A||2 



D 



Lemma 6 (Stopping Error Bound). When the algorithm stops with parameter 9 sup- 
ported on S, we have 



\0-e*h<-\K 



5"* US' 



S* 



Ku^S 



(8) 



Proof. For A e R, let 



G(A) =C{e* +A)-C (9*) - J2 



S* -S 



Hu^-s ||A|| 



It can be seen that G(0) ~ 0, and from the previous lemma, G( A) < 0. Further, G(A) 
is sub-homogeneous (over a limited range): G{tA) < tG{A) for t E [0, 1]. Thus, 
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for a carefully chosen r > 0, if we show that G{A) > for all A e {A : |1 A|J2 < 
r, ||A||o < IS"!}, where 5* = |S^U S'*|, then it follows that ||A||2 < r. If not, then 
there would exist some t G [0, 1) such that ||tA|| = r, whence we would arrive at the 
contradiction 



< G(tA) < tG{A) < 0. 

Thus, it remains to show that G(A) > for all A e {A : ||A||2 < r, 
\S\}. By restricted strong convexity property of £, we have 

£(r + A) - £(r ) > (v£(r ), A) + 1 II A||^ 



We can establish 



and hence. 



(V£(r),A)>-|(V£(r),A)| 

>-||V£(r)||^||A||i = A„||A||i 



lAlln < 



G(r + A) >-A„||A||i 



Kl. 



All 



Kl , 



>||A||2ly||A||2-A, 

>0, 



s* -s 



K„e5||A||2 



S*US 



s*-s 



Km £5 



ifllAI' 



r for 



r - - A, 



S*LiS 
This concludes the proof of the lemma. 



S* -S 



Hu<^S 



U 



Next, we note that when the algorithm terminates, the backward step with the cur- 
rent parameters has failed to go through. This entails that 



miLie^ejCj)^ C{e) >es/2. 
The next lemma shows the consequence of this bound. 



(9) 



Lemma 7 (Stopping Backward Step). When the algorithm stops with parameter 
supported on S, we have 



^s-s- 



> 

2 K 



£5 



S-S* 



(10) 
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Proof. We have 



|5-^*|inf£(?-?,e,)< ^ C0 -Oje,) 

jes-s* 



where the second inequality uses the fact that [V£(0)]g = 0. Substituting (9) above, 
the lemma follows. D 



B Lemmas on the Stopping Size 

Lemma 8. Ifes > ^ i ^^jj,^ - J^ ] and RSC ((2 + 7)^*) holds for 



some 7 > 4/9^ I \/ ^pr^ + V2 I , then the algorithm stops with fc < (1 + 7)^:*. 



Proof. Consider the first time the algorithm reaches fc = (1 + 7)fc* + 1, then by 
Lemma 9 and 1 1 , we have 



k-l-k* ^ / |^(fc-i)-^*| ^ 2Ku^n^{Ku-Ki) ^ 2k^ ( A„ ^ / 2|^*-5(fe-i) 



k-l 



l^^'^'^u^l «2^|5(/c-i)u5*| ^' V^^^^ V \s*^si^-^)\ 



2«^ /(K^)^_£^ 



^ ^' K, Y V «! / f^i ^ 2ku I K , I 2fc* 



Vfc - 1 '^i \ V««e5 V fc + A:* - 1 



Hence, we get 



^v7-vv^ r^~< ^- 



\/l +7 V2 + 7 ^/k^ 



For 7 > 4p^ I W ^pr^ + \/2 I , the LHS is positive and we arrive to a contradiction 

with the assumption on £5. 

D 

When the algorithm reaches the support size of k at the beginning of the forward 
step, i.e., we added the fc*'* variable to the support and the backward step did not 
remove any variable, let 6'^*^' denote the current parameter and S'^'^' = Supp(6'('^') 
with k = \S^'''>\. Let 9* be the target parameter matrix (i.e., E [V£(6'*)] = 0), with 
5* = Supp(6'*) and k* — \S*\. Lemmas 9, 10 and 11 follow along similar lines to 
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their counterparts in Lemmas 7, 5 and 6 respectively: the latter held when the algorithm 
terminates, while the lemmas below hold at any iterate 9^''^ where we have first added 
the k*^ variable to the support. We provide their detailed proofs for completeness. 

Lemma 9 (General Backward Step). The first time the algorithm reaches a support size 

ofk > k*+4: ( -^ ) +1 at the beginning of the forward step, assuming RSC f |5'''^-' U S* 
holds, we have 






> 



|5'(fe i)_5'*| 2kuV ku -~i \ (k) 



(11) 



Proof. Under the assumption of the lemma, the immediate previous backward step has 
not gone through and hence. 



Consequently, we get 



inf L feC^) - W^^i) - ^ f^^'' 
■ik)_3, V -J V V 



<5' 



(k) 



> 



e(fe) 



\s(k-i) _ s*\J— < J2 c(0^''^ -e^f^e,)-c{e^''^) 



< 



< 



Ku \Uk) 

2 II s('=-i)-s* 



A(fe) 



where, A^^) = 6^-1 ,, - 0^''-^l This entails that 



'\S(^-^)-S*\^^k) 



A(fe) 



< 



^(fe-i) 

S(fc-i)_5. 



Thus, it suffices to show that || A^^) |L < ^\ (ku - nAsfK 

II 1 1 z K, y / 



From the forward step, we have 

£ I oXk-i 



inf c(e^^-'^+aeA^sfl 



Let (j* , a* 7^ 0) be the optimizer of the equation above. Now, we have 



1i 
2 



A(fc) 






< 



m 


2 k; 


aw 


2 


K( 


m 


3* 


2 




2 


2 


J. 
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Hence, llA^'^)!! < ^^^ 

'II II 2 — 2ki 



Since 






< 



e^-a^ 



^ 



^ 



— Ki \ Hi t 



and we only need to show that 
a*|, we can equivalently control the latter two terms. 



First, by forward step construction, -^ Ic^*! ^ 'C 



Tfe-l) 



4" 



and hence la* I < 



—5f. Second, we claim that 



-C 



Xk-i) 



^- 2Ki, — K] I 

< — T — - a* 



and we are done. 



In contrary, suppose 

2 



2tfc) 



> 



2ku — 1^1 



laJ > ^ la* r. We have 

II K/ I I 



n{k) 



^ Km I ,2 






> — 






This is a contradiction provided that -y 



Later, we will show that Sign ( V j, C 



oik) 



aw 



\k-l) 



2 + T 

2 



V,- £ 






fe-i)^ ffltfe) 



V,- £ 



Tfe-i: 



fl(fe) 



Sign (a*) andKi|a*| < 



- a J > 0. 



V,.£ 



'(fc-i) 



K„|aJ. With these, if -^^ < 1, we have V,.£ (^('=-1 



fl(fc) 



claim follows. Otherwise, we have 






> 



^.) 



a^k) 



> and the 



so that 



n(k) 

2 



> ^^^ a* and hence. 



^ 



v,x(e^^-^ 



j(fe) A \ '*' 2k„ I 

■'* / 2 Kl 

= 0. 



fl(fe) 



«fe) 



< 



To get the claimed properties of Vj^ £ I 6'*^'^ ^' ) , note that 

I |a*|^ < £ (^('^-i)) - £ pC^-i) + a*e,.) 



-Sign (a*) and K;|a*| < 



and hence Sign (Vj,£ (^C^-i) 
Also, we can establish 

~2 

>-Y|c^*l'-V,.£f?('=-i)|a 



aj- >£(0('^^^M -£|y^'^"^^+a*ej. 



Tfe-l) 



V,- £ (^('^-D 
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Since —V,- £ ( d'-'^ ^M a* > 0, we can conclude that 



V,- £ 



Tfc-i) 



< Ki,. a* 



This concludes the proof of the lemma. 



D 



Lemma 10 (General Forward Step). The first time the algorithm reaches a support size 
ofk at the beginning of the forward step, we have 



£(r)-£(^('^-i))| <J\ 



s* - ^(fe-1) 



Hu Of 



\k~i) 



Proof. Under the assumption of the lemma, we have 

C (e^''-^A - inf C (^('^-i) + ae,) - S^l 

For any 77 G M, we have 



(fc) 



s* - ^c^-i) 









0* _ e^k-i) 



Optimizing the RHS over rj, we obtain 



2 '^u 



0* _ ^(fe-1) 



l^*_^(fc-l)l^(fc) 



> 



£{9*)-£(0(''-^'i'^y 



This concludes the proof of the lemma. 



D 



Lemma 11 (General Error Bound). The first time the algorithm reaches a support size 
ofk at the beginning of the forward step, assuming RSC I \S^^> U 5** | ) holds, we have 






2^ 4K^\S*US('^-^ndf ( A„ 



'2|5*-^('=-i)| 



V«^I^ V |5'*U5'(fe-i)| 



Proof. Let 



G(A):=£(r +A)-/:(r)-y'2|5*-^('=-i)|K„4''^||A||2. 

It can be seen that G(0) = 0, and from Lemma 10, GiO^^^^^ - 61* ) < 0. Further, G(A) 
is sub-homogeneous (over a limited range): G{tA) < tG{A) for t G [0, 1]. Thus, for 
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a carefully chosen r > 0, if we show that G{A) > for all A e {A : |1A|12 < 
r, ||A||o < \S\}, where S = jS'W U S*\, then it follows that |1^W - e*\\2 < r. If 
not, then there would exist some t G [0, 1) such that \\t{9^''^ — 0*)\\2 — r, whence we 
would arrive at the contradiction 

< G (1(9'^''^ -0*)\ < tG f^e^) -0*) < 0. 

Thus, it remains to show that G(A) > for all A e {A : ||A||2 < r, ||A||o < 
|5|}. ByRSC, wehave 

C{9* + A) - C{0*) > VC{9*) ■ A + y II A||2. 

We can estabhsh 

v£(r)-A>-|v/:(r)-A| 

>-||V£(r)||co||A||i = -A„||A||i, 
and hence. 



G{0* + A) > -A„||A||i + ^||A||2 - ^2\S* - SC^-^^lnJ^/'WAh 



> IIAII2 (^|||A||2 - Xn^/\S*US('')\-^2\S*-S(''~^)KSf 

>o, 

if IIAII2 =rfor 

Hence, 

2 4KjS*US^'^-^^sf ( A„ /2|^*-g(fe-i)|" 






< 



.2 



7;;;:4^ v i^^u^e^-^i 



Finally, consider the fact that o\ > eg- This concludes the proof of the lemma. D 
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